Coresets for k-Means and k-Median Clustering and their Applications

نویسندگان

  • Sariel Har-Peled
  • Soham Mazumdar
چکیده

In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in IR, one can compute a weighted set S ⊆ P , of size O(kε−d log n), such that one can compute the k-median/means clustering on S instead of on P , and get an (1 + ε)-approximation. As a result, we improve the fastest known algorithms for (1 + ε)-approximate kmeans and k-median. Our algorithms have linear running time for a fixed k and ε. In addition, we can maintain the (1 + ε)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Distributed Balanced Clustering via Mapping Coresets

Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the “balanced clustering” problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound di...

متن کامل

On the Sensitivity of Shape Fitting Problems

In this article, we study shape fitting problems, -coresets, and total sensitivity. We focus on the (j, k)-projective clustering problems, including k-median/k-means, k-line clustering, j-subspace approximation, and the integer (j, k)-projective clustering problem. We derive upper bounds of total sensitivities for these problems, and obtain -coresets using these upper bounds. Using a dimension-...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Algorithms for the Bregman k-Median problem

In this thesis, we study the k-median problem with respect to a dissimilarity measure Dφ from the family of Bregman divergences: Given a finite set P of size n from R, our goal is to find a set C of size k such that the sum of error cost(P,C) = ∑ p∈P minc∈C { Dφ(p, c) } is minimized. This problem plays an important role in applications from many different areas of computer science, such as info...

متن کامل

Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003